n 

CO 

o 



MH-5061 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



Applicant: Kadir A. Peker 
A jay Divakaran 
Huifang Sun 



Serial No. : 
Filed: Herewith 



Group Art Unit: 
Examiner : 




Title: ADAPTIVELY PROCESSING A VIDEO BASED ON CONTENT 
CHARACTERISTICS OF FRAMES IN THE VIDEO 



"EXPRESS MAIL" mailing label number BL031856685US 
Date of Deposit: November 17, 2QQ0 



I hereby certify that this correspondence is being deposited with the 
United States Postal Service "EXPRESS MAIL Post Office to Addressee" 
service under 37 CFR 1.10 on the date indicated above and is addressed to; 
Assistant Commissioner for Patents, Washington, D.C, 20231. 

Kelli J. Withrow 



Printed name of deQosito^ 



of deoc 



Signature of depositor 



TRANSMITTAL OF FILING UNDER 37 CFR 1.53(b) 



Assistant Commissioner for Patents 
Washington, DC 20231 

ATTN: Box Patent Application 



Sir: 

This is a request for filing a Continnation-in-Part (CIP) 



application under 37 CFR §1.53(b), of pending prior application 
Serial No. 09/634, 364 , filed on 08/09/00 entitled METHODS FOR 
SUMMARIZING A VIDEO USING MOTION AND COLOR DESCRIPTORS 



PLG - 10/94 
CON-XTML . FRM 



MERL iigcGir 



l.Fee Calculation (37 CFR 1.16) 

CLAIMS AS FILED 





Niimber 
filed 


Number 
extra 


Rate 


Calculations 


Total Claims 
(37 CFR 1.16(c)) 


27 - 20 


7 


0 $ 18.00 


$ 12 6 


Independent 
Claims (37 CFR 
1.16(b)) 


2 - 3 = 


> 


X $ 78.00 


$ 


Multiple Dependent Claim (s) if any (37 CFR 
1.16(d)) 


+ $ 260.00 


$ > 






Basic Fee 


+ $ 710.00 




Total of above Calculations = 


$ 








Assignment 


$ 40.00 








TOTAL 


$ 876.00 



2. Inventorship Statement 

With respect to the prior copending U.S. application from which 
this application claims benefit under 35 U.S.C. 120, the 
inventor (s) in this application are the same. 



3 • Assignment 

[X] an assignment of the invention to Mitsubishi Electric 
Research Laboratories, Inc. is attached. A separate 
"ASSIGNMENT COVER LETTER ACCOMPANYING NEW PATENT 
APPLICATION" is also attached. 



4. Fee Payment Being Made At This Time 

[] Enclosed 

[] basic filing fee $ > 

[] recording assignment 

($40.00; 37 CFR 1.21(h)) $ > 

Total fees enclosed $ > 



[X] charge Account No. 50-0749 in the amount of $ 876.00 . A 
duplicate of this request is attached. 



PLG - 10/94 
CON-XTML . FRM 



2 



5. Authorization To Charge Additional Fees 

[X] The Assistant Coimnissioner for Patents is hereby 

authorized to charge the following additional fees which 
may be required by this paper and during the entire 
pendency of the application to Account No. 50-0749 

[] 37 CFR 1.16(a), (f) or (g) (filing fees) 
[] 37 CFR 1.16(b), (c) and (d) (presentation of extra 
fees) 

6, Power of Attorney 

[X] The power of attorney in the prior application is to 
Dirk Brinkman , Reg. No. 35, 460 . 

[X] All future correspondence should be addressed to: 
Patent Department 

Mitsubishi Electric Research Laboratories, Inc. 
2 01 Broadway, 8^^ Floor 
Cambridge, MA 0213 9 



Respectfully submitted, 

Mitsubishi Electric Research Laboratories, Inc. 




Reg. No. 35,460 
Attorney for Assignee 



Patent Department 

Mitsubishi Electric Research Laboratories, Inc. 
201 Broadway, 8^^ Floor 
Cambridge, MA 02139 
(617) 621-7539 



PLG - 10/94 
CON-XTML . FRM 



3 



EXPRESS MAIL number: EL03 1856685US 
Date of Deposit: November 17, 2 000 

I hereby certify that this paper is being depo- 
sited with the United States Postal Service 
"EXPRESS MAIL Post Office to Addressee" service 
under 37 CFR 1.10 on the date indicated above 
and is addressed to the Assistant Commissioner 
for Patents; Washington, DC 2 0231. 



Typed name of person mailing paper or fee 



APPLICATION FOR UNITED STATES LETTERS PATENT 



Kelli J. Wi throw 





Signature 



Title: 



ADAPTIVELY PROCESSING A VIDEO BASED ON CONTENT 
CHARACTERISTICS OF FRAMES IN A VIDEO 



Inventors : 



Kadir A. Peker 
A jay Divakaran 
Huifang Sun 



MH-5061 
Peker et al. 



Adaptively Processing a Video Based-on Content Characteristics of 

Frames in the Video 

Cross-Reference to Related Application 

This is a continuation-in-part of U.S. Patent Application Sn. 09/654,364 filed 
August 9, 2000 by Divakaran et al. 

5 

Field of the Invention 

This invention relates generally to processing videos, and more particularly to 
=0 adaptively processing videos based on characteristics of content of frames of 
U) the video. 

Background of the Invention 

r° Standard Processing Techniques 

Basic standards for processing a video encoded as a digital signal have been 
adopted by the Motion Picture Expert Group (MPEG). The MPEG standards 
achieve high data compression rates by developing information for full frames 
of the video only every so often. The full frames, i.e., intra-coded frames, are 
often referred to as "I-frames" or "reference frames," and contain full frame 

20 information independent of any other frames. Image difference frames, i.e., 
inter-coded frames, are often referred to as "B-frames" and "P-frames," or as 
"predictive frames," and are encoded between the I-frames and reflect only 
image differences i.e., residues with respect to the reference frame. 
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Typically during the processing, each frame of a video is partitioned into 
smaller blocks of picture element, i.e., pixel data. Each block is subjected to a 
discrete cosine transformation (DCT) function to convert the statistically 
dependent spatial domain pixels into independent frequency domain DCT 
5 coefficients. Respective 8x8 or 16x16 blocks of pixels, referred to as "macro- 
blocks," are subjected to the DCT function to provide the encoded signal. The 
DCT coefficients are usually energy concentrated so that only a few of the 
coefficients in a macro-block contain the main part of the picture information. 
For example, if a macro-block contains an edge boundary of an object, then the 
10 energy in that block, after transformation, as represented by the DCT 
1= I coefficients, includes a relatively large DC coefficient and randomly distributed 
J; AC coefficients throughout the matrix of coefficients. 

f \ A non-edge macro-block, on the other hand, is usually characterized by a 
M similarly large DC coefficient and a few adjacent AC coefficients which are 
Y' substantially larger than other coefficients associated with that block. The DCT 
i»= coefficients are typically subjected to adaptive quantization, and then are run- 
ci length and variable-length encoded. Thus, the macro-blocks of transmitted data 
typically include fewer than an 8 x 8 matrix of code words. 

20 

The macro-blocks of inter-coded frame data, i.e., encoded P or B frame data, 
include DCT coefficients which represent only the differences between 
predicted pixels and actual pixels in the macro-block. Macro-blocks of intra- 
coded and inter-coded frame data also include information such as the level of 
25 quantization employed, a macro-block address or location indicator, and a 
macro-block type. The latter information is often referred to as "header" or 
"overhead" information. This provides good spatial compression of the video. 
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Each P-frame is predicted from the last most occurring I- or P-frame. Each B- 
frame is predicted from an I- or P-frame between which the B-frame is 
disposed. The predictive coding process involves generating displacement 
5 vectors, often referred to as "motion vectors," which indicate a magnitude of 
the displacement of the macro-block of an I-frame that most closely matches 
the macro-block of the B- or P-frame currently being coded. The pixel data of 
the matched block in the I frame are subtracted, on a pixel-by-pixel basis, from 
the block of the P- or B-frame being encoded, to develop the residues. The 
10 transformed residues and the vectors form part of the encoded data for the P- 
and B-frames. This provides good temporal compression. 

Video Analysis 

m Video analysis can be defined as processing a video with the intention of 
fl understanding the content of the video. The understanding of the video can 

range from a "low-level" syntactic understanding, such as detecting segment 
1=^ boundaries or scene changes in the video, to a "high-level" semantic 
O understanding, such as detecting a genre of the video. The low-level 

understanding can be achieved by analyzing low-level features, such as color, 
20 motion, texture, shape, and the like, to generate content descriptions. The 
content description can then be used to index the video. The high-level 
understanding can be encoded at the source, or in some instances derived from 
low-level features, see Yeo et al. "Rapid scene analysis on compressed videos" 
IEEE Transactions on Circuits and Systems for Video Technology, vol. 5:pp 
25 533-544, 1995, Meng et al. 'TVEPS: A compressed video editing and parsing 
system'' ACM Multimedia Conference, 1996, and Chang et al. ''Compressed- 



3 



4 



MH-5061 
Peker et al. 

domain techniques for image/video indexing and manipulation,'' IEEE 
International Conference on Image Processing, Volume-I, pp. 314-317, 1995. 

Video Summarization 
5 Video summarization can be defined as a process that produces a compact 
representation of a video that still conveys the semantic essence of the video. 
The compact representation can include key frames or key segments, or a 
combination of key frames and segments. As an example, a video summary of a 
tennis match can include a small key segment and a key frame. The key 
10 segment captures both of the players in action during the very last winning 
r;i retum, and the key frame captures the winner with the trophy. A more detailed 

I and longer summary could include all frames of the match game or point. While 
j^j it is certainly possible to generate such a summary manually, this is tedious and 
costly. 

m 

j"' Automatic video summarization methods are well known, see S. Pfeiffer et al. 

in "Abstracting Digital Movies Automatically'' J. Visual Comm. Image 
U Representation, vol. 7, no. 4, pp. 345-353, December 1996, and Hanjalic et al. 

in "An Integrated Scheme for Automated Video Abstraction Based on 
20 Unsupervised Cluster-Validity Analysis," IEEE Trans. On Circuits and Systems 

for Video Technology, Vol. 9, No. 8, December 1999. 

Most known video summarization methods focus on color-based 
summarization. Pfeiffer et al. also uses motion, in combination with other 
25 features, to generate video summaries. However, their approach merely uses a 
weighted combination that overlooks possible correlation between the 
combined features. 
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While color descriptors are robust, they do not include the motion 
characteristics of the video sequence by definition. On the other hand, motion 
descriptors tend to be less robust to noise than color descriptors and have 
5 generally not been as widely used for summarization. 

The level of motion activity in a video can be a measure of how much the scene 
acquired by the video is changing. Therefore, the motion activity can be 
considered a measure of the "summarizability" of the video. For instance, a 
10 high speed car chase will certainly have many more "changes" in it compared to 
a scene of a news-caster, and thus, the high speed car chase scene will require 
more resources for a visual summary than would the news-caster scene. 

''si 

It is desired to adaptively process a video using content characteristics of 
f B frames in the video. During the processing, play time for the frames of the 
video should be allocated on a basis of content characteristics. 

Summary of the Invention 



20 The invention provides a system and method for temporally processing an input 
video including input frames. Each frame has an associated frame play time, 
and the input video has a total input video play time that is a sum of the input 
frame play times of all of the input frames. Each of the input frames is 
classified according to a content characteristic of each frame. An output frame 

25 play time is allocated to each of the input frames that is based on the classified 
content characteristic of each of the input frames to generate a plurality of 
output frames that form an output video. 
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The content characteristic can be on low-level features and/or high-level 
features of each of the input frames, and the allocated play time is dynamically 
varied while processing the video. The allocation can be done by sampling the 
5 frames, or by varying the frame rate. 

Brief Description of the Drawing 

Figure 1 is a block diagram of a system for adaptively processing videos 
10 according to the invention; 

;?;{ Figure 2 is a block diagram of an adaptive process based on motion activity 
\1 characteristics of content of the video; and 

$5 Figure 3 is a flow diagram of a method for processing a video according to the 
i== invention. 



Detailed Description of the preferred Embodiment 

20 Figure 1 is a top-level view of our system and methods 100 for adaptively 
processing a video based on selected characteristics 103 or features extracted 
from the content of a video. An input video 101 to our system and methods 100 
is a temporally ordered set of frames V(l,2, N-1, N) that comprise the video. 

25 The system generates an output video 102 that is dependent on the selected 
characteristics of the video. In one embodiment of the invention, the output 
video 102 is a temporally ordered set of frames v(l, 2, M-1, M) where veV . 
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The invention allocates play time to the frames of the video according to the 
measured characteristics. 

As an advantage of the invention, the amount of play time that is allocated to 
5 any selected frame can span a continuum from no time - the frame is not played 
at all, a short time - the frame is sped up, a normal play time, a long time - the 
frame is slowed down, to the length of time of the output video - in which case 
a single frame represents the entire input video. 

10 Our invention can dynamically process the video while the video is played. In 

^=1 other words, the user determmes how much time to allocate to each portion of 

^ the video in real-time. Alternatively, the output video can be generated for later 

Tl playing. 

M The selected characteristics can be based on low-level (syntactic) features, or 

f - high-level (semantic) features, or combinations of various high- and low-level 
features. 



Low-level features can include color, texture, brightness, contrast, spectral 
20 parameters, local and global motion, activity, trajectory and its parameters, 
speed, acceleration, object shape, object size, number of objects, number of 
faces, pitch, volume, tempo, to name some examples. High-level features can 
include genre, dramatic intensity, humor content, action level, beauty, lyricism, 
musical intensity, educational quality, profundity, nudity, linguistic class and so 
25 forth, see Divakaran et al. ''Report on Validation Experiment on Ordered 
Relation Graphs," ISO/IEC JTC1/SC29AVG11/MPEG99/M5365, December 
1999. 
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Adaptive Sampling 

Figure 2 shows how the invention operates when the measured characteristic is 
5 motion activity 203. The line 210 represents the relative motion activity in the 
input video 201 over time. When the line 210 is substantially flat, the relative 
intensity of the motion activity is low, e.g., during frames 21 1. When the line 
fluctuates rapidly, the relative intensity of the motion activity is high, e.g., 
during frames 212. The desired output video 202 has a predetermined constant 

10 level of motion activity as represented by line 220. In other words, the user of 
the system has determined that the video should be viewed at some constant 
level of motion activity. It should be understood, that different users can select 

\Z different levels of activity at which they desire to view the video. For example, 
a viewer that is familiar with the content can view and absorb the video at a 

in much higher rate than someone who is totally unfamiliar with the content. 

Therefore, the system 100 samples frames 21 1 at a higher rate, and frames 212 
are sampled at a lower rate. In other words, the sampling rate (down-sampling 
or up-sampling) is adaptive to the measured level of motion activity. Low-level 
20 activities are sped up, and high-level activities are sampled at a normal rate or 
slowed down. In fact, if the level of motion activity is too high to enable normal 
perception, then the frames 211 can be up-sampled. For example, a one second 
sequence of thirty frames can be expanded to a ten second sequence of three- 
hundred frames by showing each frame ten times. 

25 

As a refinement, the additional frames can be interpolated from one frame to 
the next to smooth the motion of the up-sampled frames. If the video is in 
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MPEG format, then the interpolation can be done by generating additional 
intra-frames. In this case, it will appear as if the video is played in slow motion. 
In any case, the sampling rate determines how much play time is allocated, on a 
continuum, on a per frame basis. 

5 

In an alternative embodiment, the level of motion activity is adaptively altered 
by changing the frame rate. Increasing the frame rate, decreases the amount of 
play time that is allocated to each of the frames. An increased frame rate results 
in faster movement, i.e., the faster the frame rate, the faster the objects in the 
10 video appear to move, and therefore the larger the motion vectors. Decreasing 
i=l the frame rate has the opposite effect. Therefore, the frame rate varies with the 
level of motion activity. 

J=Jj In some sense, sampling can be considered extreme variations on changing the 
M frame rate. If the frame rate is increased, then the play time of each of the 

frames is decreased. Thus, if the instantaneous frame rate is infinite, then the 
l= = play time is decreased to zero, and the frame is, in effect deselected or not 
O sampled. Likewise as the frame's play time is increased, the instantaneous 

frame is decreased. Thus, if the frame rate is decreased to a very low number, 
20 lets say one frame per ten seconds, then the video is reduced to a sequence of 

one or more stills. 

Measure of Motion Activity 

25 One measure of motion activity can be the average of the magnitude of the 
motion vectors, see Peker et al. ''Automatic measurement of intensity of motion 
activity," Proceedings of SPIE Conference on Storage and Retrieval for Media 
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Databases, January 2001. However, there are many variations possible, 
depending on the apphcation. For instance, we use the average motion vector 
magnitude as a measure of motion activity to favor segments with moving 
regions of significant size and activity, and we use the magnitude of the shortest 
5 motion vector as a measure of motion activity to favor segments with 

significant global motion. It should be understood that other statistical moments 
such as standard deviation, median, variance, skew, and kurtosis can also be 
used. 

1 0 Guaranteed Minimum Level of Motion Activity 



The sampling or frame rate processing steps described above can be adapted to 

provide a guaranteed minimum level of activity, as opposed to a constant level 
: of activity, as described above. Then, the guaranteed minimum level of activity 
K can be used as a "control knob" that can go from the continuum of just a one 
H= frame output video, to the entire input video being the output video. In the latter 

case, the guaranteed minimum level of activity is equal to the minimum activity 
13 level present in the input video. Thus, the size of the output video can range 

from a single frame to the entire input video. 

20 

The average motion vector magnitude provides a convenient linear measure of 
motion activity. Decreasing the allocated play time by a factor of two, for 
example, doubles the average motion vector magnitude. The average motion 
vector magnitude F of the input video of //frames can be expressed as: 

25 r = (-)tr., 

where the average motion vector magnitude of frame i is ru 
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For a target level of motion activity ^^^^^,111 the output video, the relationship 
between the length Loutput of the output video and the length Li^put of the input 
video can be expressed as: 

output input 
^target 



However, the target motion activity measure does not allow us to span the 
continuum from the entire video to a one-frame output video. 

10 Therefore, we use the guaranteed minimum activity method to achieve this 

m continuum. In this method, we speed up or decrease allocated play time of all 

U portions of the input video that are lower than the targeted minimum motion 

111 activity Vtarget so that all these portions attain the targeted motion activity using 

S the above formulations. The portions of the input video that exceed the targeted 

|5 motion activity can remain unchanged. 

In one extreme, where the guaranteed minimum activity is equal to the 
^ minimum motion activity in the input video, the entire input video becomes the 
output video. When the guaranteed minimum activity exceeds the maximum 
20 motion activity of the input video, the problem reduces to the above constant 
activity case. In the other extreme, where the targeted level of activity is 
extremely high, the output video includes only one frame of the input video as a 
result of down-sampling or fast play. 

25 The length of the output video can be determined as follows. First, classify all 
of the frames of the input video into two sets. A first set Shigher includes all 
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frames 7 where the motion activity is equal to or higher than the targeted 
minimum activity. The second set Slower includes all frames k where the motion 
activity is lower than the targeted motion activity. Then, the length of the input 
video is expressed by: 

Linput ~ ^higher ^-'lower- 



The average motion activity of frames j that belong to the set Sioy^er is 



r 

lower ^ T. T 

lower J 



the length of the output converted is 



^output 



lower J , T 

^ower ^higher ' 

^target j 



It is now apparent that the guaranteed minimum activity approach reduces to 
p ] the constant activity approach because when Lughei becomes zero, the entire 
^i=i input video needs to be processed. 

15 The guaranteed minimum motion activity method can now proceed as follows. 

n First we assign actual motion activity values, in terms of a continuous 

descriptor, to each level of motion activity. Second, we express the average 
motion activity of the input video as a temporal histogram of the motion activity 
as described in U.S. Patent Application Sn. 09/406,444 ""Activity Descriptor for 

20 Video Sequence^ filed by Divakaran et al. on September 27, 1999, incorporated 
herein by reference. The temporal histogram directly indicates what frames of 
the input video have a level of motion activity that is lower than the targeted 
activity in a quantized fashion so the above classification can be performed. 
Thkd, we associate the temporal histogram with the actual motion values, and 

25 apply the guaranteed minimum activity method as expressed in the above 
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formulations to determine the relationship between the length of the output 
video and the targeted level of motion activity. 

Processing of Video 

5 

Figure 3 shows the steps involved in the generalized for temporally processing 
the input video. Step 310 optionally partitions the input video 301 into "shots" 
or segments 311 using known scene change detection techniques. This is based 
on the observation that dominant characteristics are frequently clustered by 
10 segments, shots, or scenes. Then, different feature extraction techniques can be 
P| appHed depending on the dominant characteristics of a particular segment. 

Step 320 measures selected characteristics 321, such as motion activity, color, 
[ shape, etc., of the content of the frames of each of the segments 311 using any 
M of the methods as described above. 

The measures 3 12 are used to classify the frames 3 15 of the each of the 
O segments 311. The measures 3 1 2 can include the average 3 1 3 , or other derived 
statistical moments 314. 

20 

Step 330 temporally and adaptively allocates play time to each frame according 
to the classification of the frames. The allocated play time can be determined by 
selectively sampling (down- or up-sampling) the frames, or by varying the 
frame rate. The allocation of play time can be constrained by user selected 
25 allocation parameters 331 such as total play time for the output video 302, 
constant level of motion activity, minimum level of motion activity, and the 
like. It should be understood that varied allocation of play time by either 
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sampling or frame rates can be combined while composing the output video 
302. It should be understood that the allocation of play time can be dynamically 
varied by controls 333 which selects a level of what ever the selected 
characteristics are. 

5 

Note, our invented technique is distinguished from prior art techniques that 
directly vary play time, such as fast-forward, and slow-motion. Those 
techniques directly vary the frame rate independent of the content. In contrast, 
we vary the desired level of characteristic, e.g., motion and activity or color, 
10 and then indirectly vary the frame rate accordingly. 

Processing Controlled by other Characteristics 

As stated above, the adaptive processing can be controlled by other 
Hi characteristics of the video. For example, the characteristics 321 can be a 

dominant color or colors. For example, if the selected dominant color in the 

frames is to be green, the video is sampled at a higher rate then when there is 
a little or no green in the video. This is useful in processing videos of sporting 

events. The processed video can discard "crowd" scenes or commercials, and 
20 then, only frames reflecting activities on the playing field are incorporated into 

the output video. 

If the dominant color is skin color, then only frames including people are 
sampled. For example, if a frame has more than 25% skin color then the frame 
25 is selected so that the output video is more likely to have scenes where people 
are talking, see U.S. Patent 5,940,530 "Backlit scene and people scene 
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detecting method and apparatus and a gradation correction apparatus'' issued 
to Fukushima, et al. on August 17, 1999. 

Object shape can also be used as a content characteristic. For example, 
5 selectively sample frames with a ball-like object to summarize a tennis match. 
Note, MPEG-4 provides elementary bit streams on a per object basis. Trajectory 
can be used to discard frames with predominantly linear motion, and keep 
frames with a higher level of non-linear motion. Texture can be used to sample 
frames with brick buildings, foliage, waves, or any other selected texture, see 
10 Brodatz, "Textures - A Photographic Album for Artists and Designer,'' Dover, 
NY 1966 for standard textures. 

Frame Rate 

iS In theory, it is possible to play the video at any number of different frame or 
\^ sampling rates. However, the temporal Nyquist rate puts limits on how fast the 
h video can be played without becoming imperceptible to the viewer. A simple 
a way of visualizing this is with a video sequence illuminated by a light that is 

strobed. When the frame rate is equal to the rate of strobing, the scene will 
20 appear stationary. Thus, the maximum level of motion activity in a particular 
segment of the video determines how fast the video can be played. Furthermore, 
as the rate of sampling decreases, (or the frame rate increases) the segments of 
the video will be reduced to a set of ''stilF' frames or a "slide show." Depending 
on the content and the level of motion activity, a cross-over point can be 
25 determined where it becomes more efficient to play the video segment as a slide 
show rather than a "moving" video. 
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Applications 

We have applied our invention to a number of videos with diverse contents. For 
example, a video acquired from a surveillance camera directed at a highway 
5 produces very satisfactory results. Segments of the video where there is very 
little traffic are skipped over rapidly, to allow the viewer to focus on those 
segments with significant traffic. The invention works equally well with videos 
of sporting events, or news broadcasts. 

10 Our invention is also useful for video browsing. The amount of video that is 
now accessible is enormous. Our methods are well suited for local content, and 
indispensable for browsing remote content, e.g., content accessed over the 

\t Intemet, because we enable a more efficient use of the limited available 
bandwidth. 

Our invention is extremely useful for surveillance applications. For example, a 
\^ set of surveillance cameras in a building can acquire many thousands of hours 
Q of videos in a day or so. Normally, most of the videos will have a constant 

characteristic, that is a low-level of motion activity or color/audio change, more 
20 likely none at all. Only a small portion of the videos will record any significant 

"security" events. Therefore, our invention allows a user to quickly access those 

portions of the videos that warrant closer inspection. 

We can also increase the efficacy of our methods by reducing the amount of 
25 noise in the motion vectors. We can also combine various video characteristics, 
such as motion activity and color to refine the output video. 
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Although the invention has been described by way of examples of preferred 
embodiments, it is to be understood that various other adaptations and 
modifications may be made within the spirit and scope of the invention. 
Therefore, it is the object of the appended claims to cover all such variations 
and modifications as come within the true spirit and scope of the invention. 
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Claims 



We claim: 

1 LA method for temporally processing an input video including a plurality 

2 of input frames, each of the input frames having an associated input frame 

3 play time, and the input video having a total input video play time that is a 

4 sum of the input frame play times of all of the input frames, comprising: 

5 classifying each of the plurality of input frames according to a content 

6 characteristic of each of the input frames; and 

7 allocating an output frame play time to each of the plurality of input 

8 frames that is based on the classified content characteristic of each of the 

9 input frames to generate a plurality of output frames, 

1 2. The method of claim 1 wherein the content characteristic is based on low- 

2 level features of each of the input frames. 

1 3. The method of claim 1 wherein the low-level features are selected from a 

2 group consisting of motion vectors, color, texture, brightness, contrast, 

3 spectral parameters, local and global motion, activity, trajectory, speed, 

4 acceleration, object shape, object size, number of objects, number of faces, 

5 pitch, volume, tempo, and combinations thereof. 

1 4. The method of claim 1 wherein the content characteristic is based on 

2 high-level features of each of the input frames. 
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1 5. The method of claim 1 wherein the high-level features are selected from a 

2 group consisting of genre, dramatic intensity, humor content, action level, 

3 beauty, lyricism, musical intensity, educational quality, profundity, nudity, 

4 linguistic class, and combinations thereof. 

1 6. The method of claim 1 wherein the allocating of the play time is 

2 dynamically varied while processing the video. 

1 7. The method of claim 1 wherein the allocated output frame play time of 

2 each of the output frames is determined by sampling the input frames. 

1 8. The method of claim 7 wherein the sampling is a down-sampling of the 

2 input frames. 

1 9. The method of claim 7 wherein the sampling is an up-sampling of the 

2 input frames. 

1 10. The method of claim 9 wherein up-sampled output frames are 

2 interpolated from the input frames. 

1 11. The method of claim 7 wherein the sampling is a combination of down- 

2 sampling and up-sampling of the input frames. 

1 12. The method of claim 1 wherein the allocated output frame play time of 

2 each of the output frames is determined by an output frame rate of the output 

3 frame. 
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1 13. The method of claim 12 wherein the output frame rate is increased for 

2 selected input frames. 

1 14. The method of claim 12 wherein the output frame rate is decreased for 

2 selected input frames. 

1 15. The method of claim 1 further comprising: 

2 measuring the content characteristics of each of the plurality of input 

3 frames to determine the classification. 

1 16. The method of claim 15 further comprising: 

2 computing a statistical moment for the measured characteristics to 

3 determine the classification. 

1 17. The method of claim 1 wherein the allocation of play time is based on a 

2 constant level of motion activity in the output video. 

1 18. The method of claim 1 wherem the allocation of play time is based on a 

2 guaranteed minimum level of activity in the output video. 

1 19. The method of claim 1 further comprising: 

2 partitioning the input video into a plurality of segments, and 

3 processing the input video on a per segment basis. 
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1 20. The method of claim 1 wherein still frames are selected for the output 

2 video when the allocated output frame play time exceeds a temporal Nyquist 

3 limit. 

1 21. The method of claim 1 further comprising: 

2 allocating a total output video play time; and 

3 allocating the output frame play times so that a sum of the output 

4 frame play times of the plurality of output frames is equal to the total output 

5 video play time of the output video. 

1 22. The method of claim 1 wherein the allocated play time of a particular 

2 frame can range on a continuum from zero time to a length of time of the 

3 output video. 

1 23. The method of claim 1 wherein the allocation of play time is based on a 

2 motion activity in the output video, and a measure of motion activity is an 

3 average of magnimdes of motion vectors of the frames. 

1 24. The method of claim 23 where the average motion vector magnitude F of 

2 the input video of N frames is expressed as: 

1 ^ 

4 where an average motion vector magnitude of frame i is r,. 

1 25. The method of claim 24 wherein a relationship between a length Loutput of 

2 the output video and a length linput of the input video is expressed as 
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-^output ^input 



4 for a target level of motion activity rtarget in the output video. 

1 26. The method of claim 25 further comprising: 

2 classifying all of frames ; of the input video having the motion activity 

3 equal to or higher than a targeted level of minimum motion activity into a 

4 first set S higher having a length Lhighen 

5 classifying all of frames k of the input video having the motion 

6 activity lower than the targeted level of minunum motion activity into a 

7 second Sioy^er having a length LioweA 

8 summing Lhigher + Uower to determine a L,„p„, of the input video to 

9 determine a length of the output video by 



10 ^output ~ 



^ower ^higher ' 



lower 



y target j 



1 27. A system for temporally processing an input video including a plurality 

2 of input frames, each of the mput frames having an associated input frame 

3 play time, and the input video having a total input video play time that is a 

4 sum of the input frame play times of all of the input frames, comprising: 

5 means for classifying each of the plurality of input frames according 

6 to a content characteristic of each of the input frames; 

7 means for allocating a total output video play time; and 

8 means for allocating an output frame play time to each of the plurality 

9 of input frames that is based on the classified content characteristic of each 
10 of the input frames to generate a plurality of output frames so that a sum of 
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the output frame play times of the plurahty of output frames is equal to the 
total output video play time of the output video. 
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Abstract of the Disclosure 

A system and method for temporally processing an input video including 
input frames. Each frame has an associated frame play time, and the input 
video has a total input video play time that is a sum of the input frame play 
times of all of the input frames. Each of the input frames is classified 
according to a content characteristic of each frames. An output frame play 
time is allocated to each of the input frames that is based on the classified 
content characteristic of each of the input frames to generate a plurality of 
output frames that form an output video. 
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DECLARATION AND POWER OF ATTORNEY 



DECLARATION: 



As a below named inventor, I hereby declare that: 

My residence, post office address and citizenship are 
as stated below next to my name . 

I believe, the below named inventors are the original, 
first and joint inventors of the subject matter which is 
claimed and for which a patent is sought on the invention 
for ADAPTIVELY PROCESSING A VIDEO BASED ON CONTENT 
CHARACTERISTICS OF FRAMES IN THE VIDEO, the specification of 
which is attached hereto unless the following box is 
checked. 

[_] was filed on > as Application Serial Number > and 
was amended on > (if applicable) . 



I hereby state that I have reviewed and understand the 
contents of the above-identified specification, including 
the claims. 

I acknowledge the duty to disclose information which is 
material to patentability in accordance with Title 37, Code 
of Federal Regulations, §1.56. 

I hereby claim foreign priority benefits under Title 
35, United States Code, §119 (a) -(d) of any foreign 
application (s) for patent or inventor's certificate listed 
below and have also identified below any foreign application 
for patent or inventor's certificate having a filing date 
before that of the application on which priority is claimed: 
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I hereby claim the benefit under Title 35, United States 
Code §119 (e) of any United States Provisional application (s) 
listed below. 



APPLICATION NUMBER 


FILING DATE 
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> 


> 



I hereby claim the benefit under Title 35, United States Code, 
§120 of any United States application (s) listed below and, 
insofar as the subject matter of each of the claims of this 
application is not disclosed in the prior United States applica- 
tion in the manner provided by the first paragraph of Title 35, 
United States Code, §112, I acknowledge the duty to disclose 
material information as defined in Title 37, Code of Federal 
Regulations, §1.56 which became available between the filing date 
of the prior application and the national or PCT international 
filing date of this application: 
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Filing Date 
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PENDING 
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I hereby declare that all statements made of my own knowl- 
edge are true and that all statements made on information and 
belief are believed to be true; and further that these statements 
were made with the knowledge that willful false statements and 
the like so made are punishable by fine or imprisonment, or both, 
under Section 1001 of Title 18 of the United States Code and that 
such willful false statements may jeopardize the validity of the 
application or any patent issued thereon. 



POWER OF ATTORNEY: 

On behalf of Mitsubishi Electric Research Laboratories, Inc., 
Assignee of my entire right, title and interest, I hereby appoint 
the following attorney with full power of substitution to act 
exclusively for Mitsubishi Electric to prosecute this application 
and transact all business in the Patent and Trademark Office 
connected therewith: Dirk Brinkman, Reg. No. 35,450. 



All correspondence should be addressed to: 
Patent Department 

Mitsubishi Electric Research Laboratories, Inc. 
201 Broadway 

Cambridge, Massachusetts 02139 
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All telephone calls should be directed to Dirk Brinkman, 
telephone number (617) 621-7539. 
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